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Abstract: This study focuses on the performance of two classical dense linear algebra algo- 
rithms, the LU and the QR factorizations, on multilevel hierarchical platforms. We first introduce 
a new model called Hierarchical Cluster Platform (HCP), encapsulating the characteristics of such 
platforms. The focus is set on reducing the communication requirements of studied algorithms at 
each level of the hierarchy. Lower bounds on communications are therefore extended with re- 
spect to the HCP model. We then introduce multilevel LU and QR algorithms tailored for those 
platforms, and provide a detailed performance analysis. We also provide a set of numerical experi- 
ments and performance predictions demonstrating the need for such algorithms on large platforms. 
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Factorisations LU et QR multi-niveaux optimales en 
communication pour plates-formes hierarchiques 



Resume : Cette etude porte sur 1' analyse des performances de deux algorithmes 
classiques de l'algebre lineaire dense, les factorisations LU et QR, sur des plates- 
formes multi-niveaux hierarchiques. Nous presentons tout d'abord un nouveau 
modele analytique appele Hierarchical Cluster Platform (HCP), encapsulant les 
caracteristiques de ce type de plates-formes. Plus precisement, l'emphase est mise 
sur ce qui se passe a chaque niveau de la hierarchic Nous etendons des bornes 
inferieures sur les communications au modele HCP. Nous introduisons ensuite 
deux algorithmes multi-niveaux adaptes a ces plates-formes pour les factorisations 
LU et QR, et analysons leurs performances. Nous presentons en outre un ensemble 
d' experiences numeriques ainsi que des predictions de performances illustrant la 
necessite de tels algorithmes sur les plates-formes a grande echelle. 

Mots-cles : QR, LU, exascale, plates-formes hierarchiques 
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1 Introduction 

Due to the ubiquity of multicore processors, solvers should be adapted to better 
exploit the hierarchical structure of modern architectures, where the tendency is 
towards multiple levels of parallelism. Thus with the increasing complexity of 
nodes, it is important to exploit these many levels of parallelism even within a sin- 
gle compute node. For that reason, classical algorithms need to be revisited so as 
to fit modern architectures that expose parallelism at different levels in the hierar- 
chy. We believe that such an approach is mandatory in order to exploit upcoming 
hierarchical exascale computers at their full potential. 

Studying the communication complexity of linear algebra operations and de- 
signing algorithms that are able to minimize communication is a topic that has 
received an important attention in the recent years. The most advanced approach 
in this context assumes one level of parallelism and takes into account the computa- 
tion, the volume of communication, and the number of messages exchanged along 
the critical path of a parallel program. In this framework, the main previous theo- 
retical result on communication complexity is a result derived by Hong and Kung 
in the 80's providing lower bounds on the volume of communication of dense ma- 
trix multiplication for sequential machines lil4l . This result has been extended to 
parallel machines IfTBTl . to dense LU and QR factorizations (under certain assump- 
tions) [7 ], and then to basically all direct methods in linear algebra |4|. Given an 
algorithm that performs a certain number of floating point operations, and given a 
size of the memory M, the lower bounds on communication are obtained by us- 
ing the Loomis-Whitney inequality, as for example in lfT5l l4l. While theoretically 
important, these lower bounds are derived with respect to a simple performance 
model that supposes a memory hierarchy in the sequential case, and P processors 
without memory hierarchy in the parallel case. Such a model is not sufficient to 
encapsulate the features of modern architectures with multilevel hierarchy. 

On the practical side, several algorithms have been introduced recently [3j [TTJ 
13 [lOl . Most of them propose to use multiple reduction trees depending on the 
hierarchy. However, the focus is set on reducing the running time without explic- 
itly taking communication into consideration. In ifTOll . Dongarra et al. propose 
a generic algorithm implementing several optimizations regarding pipelining of 
computation, and allowing to select different elimination trees on platforms with 
two levels of parallelism. They provide insights on choosing the appropriate tree, a 
binary tree being for instance more suitable for a cluster with many cores, while a 
flat tree allows more locality and CPU efficiency. However, no theoretical bounds 
nor cost analysis are provided in these studies regarding communication. 

In the first part of this paper we introduce a new model that we refer to as the 
HCP model. Provided that two supercomputers might have different communica- 
tion topologies and different compute nodes with different memory hierarchies, a 
detailed performance model tailored for one particular supercomputer is likely to 
not reflect the architecture of another supercomputer. Hence the goal of our per- 
formance model is to capture the main characteristics that influence the commu- 
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nication cost of peta- and exa- scale supercomputers which are based on multiple 
levels of parallelism and memory hierarchy. We use the proposed HCP model to 
extend the existing lower bounds on communication for direct linear algebra, to 
account for the hierarchical and heterogeneous nature of present-day computers. 
We determine what is the minimum amount of communication that is necessary at 
every level in the hierarchy, in terms of both number of messages and volume of 
communication. 

In the second part of the paper we introduce two multilevel algorithms for 
computing LU and QR factorizations (ML-CAQR and ML-CALU) that are able 
to minimize the communication at each level of the hierarchy, while performing a 
reasonable amount of extra computation. These recursive algorithms rely on their 
corresponding 1-level algorithms (resp. CAQR and CALU) as their base case. 
Indeed, these algorithms are known to attain the communication lower bounds in 
terms of both bandwidth, and latency with respect to the simpler one level perfor- 
mance model. 

2 Toward a realistic Hierarchical Cluster Platform model 
(HCP) 

The focus is set on hierarchical platforms implementing deeper and deeper hier- 
archies. Typically, these platforms are composed of two kinds of hierarchies: (1) 
a network hierarchy composed of interconnected network nodes, which is stacked 
on top of a (2) computing nodes hierarchy [6]. This compute hierarchy can be 
composed for instance of a shared memory NUMA multicore platform. 

Level i + 1 CJVi+i 




Figure 1: Components of a level i in the HCP model. 

Our HCP model considers such platforms with I levels of parallelism, and uses 
the following assumptions. Level 1 is the deepest level in the hierarchy, where 
actual processing elements are located (for example cores). An intermediate level 
i > 1 and its components, as depicted in Figure [T] have the following characteris- 
tics. A compute node of level i + 1, denoted as CiVj+i in the figure, is formed by 
Pi compute nodes of level i (two nodes in our example). These Pi compute nodes 
are organized along a 2D grid topology, that is Pi = P n x P c . . The total number 
of compute nodes of the entire platform is P = n!=i Pi* while the total number of 
compute nodes of level i is denoted P* = ]J l j=i Pj. We let Mi = M x - J]j~l Pj 
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be the aggregated memory size of a node of level i > 1, where M\ denotes the 
memory size of a processing element of level 1. The network latency a% and the 
inverse bandwidth apply throughout an entire level i, however the higher in the 
hierarchy, the more important communication costs become. We also consider a 
network buffer Bi at each level of the hierarchy. This allows to take into account 
the possibility of message aggregation for network communication, and determine 
the number of messages required to send a given amount of data, thus the latency 
cost associated with each communication at every level of the hierarchy. These 
notations will be used throughout the rest of the paper. In addition, we refer to the 
number of messages sent by a node of level i as Si, and to the exchanged volume 
of data as W%. Si is the latency cost at level i, Si = Si ■ on. Similarly, W% is the 
bandwidth cost, Wi = Wi ■ fa. 

We note that the model makes abstraction of the detailed architecture of a 
compute node or the interconnection topology at a given level of the hierarchy. 
Hence such an approach has its own limitations, since the performance predicted 
by such a model might not be extremely accurate. However, while keeping the 
model tractable, this model better reflects the actual nature of supercomputers than 
the one level model assumed so far, and it will also allow us to understand the 
communication bottlenecks of linear algebra operations. 

Communicating under the HCP model. We now describe how communica- 
tion happens in the HCP model, and how messages are routed between different 
levels of the network hierarchy. First, we assume that if a compute node of level 
i communicates, all the nodes below it participates. We denote as counterparts of 
a compute node of level i all the nodes of level i lying in remote compute nodes 
of level i + 1 having the same local coordinates. We therefore have the relation 
Wi = W i+1 /Pi. 

As an example, let us detail a communication taking place between two com- 
pute nodes of a given level i. A total of P/P* processing elements of level 1 are 
involved in sending a global volume of data Each of these elements has to send 
a chunk of data W\ = WiP/ P*. Since this amount of data has to fit in the memory 
of a processing element of level 1, we obviously have Vi, M\ > W\ = WiP/P*. 
These blocks are transmitted to the level above in the hierarchy, i.e. to level 2. A 
compute node of level 2 has to send a volume of data W2 = Pi W\. Since the net- 
work buffer size at level 2 is B>2, this requires (W2/-B2) messages. The same holds 
for any level k such that 1 < k < i, where data is forwarded by sending (Wk/B^) 
messages. We therefore have the following costs: 

w w i p k a c W k WiP* 

W k = -p^ ■ p k , S k = -g£ • a k = ■ a k . 

Network types. We assume three kinds of networks, depending on their respec- 
tive buffer size: 
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1 . fully-pipelined networks, able to aggregate all incoming messages into a sin- 
gle message. This requires that Bi > Mi, since at most Mi words of data 
can be sent. 

The fully-pipelined network case is ensured whenever Bi > Pj_iWi_i. 
Since Mi is the size of the largest message sent at level i, we assume Bi = 
Mi . We also assume that all levels below a fully-pipelined level are them- 
selves fully-pipelined. Therefore, the constraint on the buffer size becomes 
Bi = M t = Pi-iB,^. 

2. bujferized networks, allowing for message aggregation up to a buffer size 
Bi < Mi. 

3. forward networks, where messages coming from lower level are simply for- 
warded to higher levels. 

For a given leveH, a. forward network requires that Bi = Bi-\. Indeed, when 
all the sub-nodes from level i — 1 send Si-\ messages each, the number of 
forwarded messages is Si = Pi-\Si-\. 
Based on the two extreme cases, we assume the buffer size Bi to satisfy B^i < 

Bi < Pi-iBi-i. 



Lower bounds on communication. Lower bounds on communication have 
been generalized in [4| for direct methods of linear algebra algorithms which can 
be expressed as three nested loops. We refine these lower bounds under our hier- 
archical model. For matrix product-like problems, at least one copy of the input 
matrix has to be stored in memory: a compute node of level i thus needs a mem- 
ory of Mi = U(n 2 /P*). Furthermore, the lower bound on latency depends on the 
buffer size Bi of the considered level i, where a volume Wj needs to be sent in 
messages of size Bi. Hence the lower bounds on communications at level i: 

Wi = n wW' p ) (1) Si = n \B^pj' a ') (2) 

Note that, for simplicity, we expressed the bound on latency with respect to Bi 
for all level i. Since we consider B\ = M\, the lower bound on latency for level 1 
can also be expressed as S\ = Q WP) ■ 



3 Multilevel algorithms 

In this section, we introduce ML-CAQR and ML-CALU, two multilevel algo- 
rithms for computing the QR and the LU factorizations of a dense matrix A. These 
multilevel algorithms heavily rely on their relative 1 -level communication optimal 
algorithms (C AQR and CALU), and can be seen as a recursive version of these al- 
gorithms. ML-CAQR and ML-CALU recursive layout naturally allows for local 
elimination trees tailored for hierarchical platforms, thus reducing the communica- 
tion needs at each level of the hierarchy. 
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3.1 Multilevel QR factorization 

The QR factorization is a widely used algorithm. Example of use are numerous, 
be it for orthogonalizing a set of vectors or for solving least squares problems. It 
decomposes a matrix A into two matrices Q and R such that A = QR, where Q 
is orthogonal and R is upper triangular. The decomposition is obtained by using 
very stable transformations, such as Householder reflections. We assume in the 
following that the matrix Q is not stored explicitly, but rather using the compact 
YTY T representation |[T6l . 

ML-CAQR, given in Algorithm [T] is a multilevel tree-based algorithm com- 
puting the QR factorization of a matrix using Householder reflections. It is tailored 
for hierarchical platforms. Moreover, as ML-CAQR is a tree-based algorithm, 
these Householder reflectors are stored in the lower triangular part of matrix A 
using a tree structure as in Q. 

As CAQR on platforms with one level of parallelism, ML-CAQR aims at 
reducing the amount of communication required during the factorization, but at 
each level of parallelism of a hierarchical platform. At the topmost level of the 
hierarchy, ML-CAQR processes the entire input matrix A panel by panel. Each 
panel is first factored by a recursive call to ML-CAQR on the next lower level. 
The Householder reflectors are then sent to remote compute nodes and the trailing 
matrix updated using two recursive routines: ML-UPFACT and ML-UPELIM. 

(r) 

More precisely, for each recursion level r, let b r be the block size, m s = 
(iris /P rr+1 — (s— l)b T ) be the panel row count at step s, and ni = {b r+ i—sb T ) 
be the number of columns in the trailing matrix. At the topmost level I, we have 
rris = (m — (s — 1)6;) and nip = (n — sty). ML-CAQR proceeds as follows: 

1. The panel is factored by using a reduction operation, where ML-CAQR is 

the reduction operator. With a binary tree, the computation becomes: 

(r) 

(a) First P Tr subsets of the panel, of size m s / P Tr -by-b r , are recursively 
factored with ML-CAQR (with a block size 6 r _i corresponding to the 
next level deeper in the hierarchy). At the deepest level of recursion, 
ML-CAQR calls the CAQR algorithm. 

(b) The resulting b r -by-b r R factors are eliminated two-by-two using an 
elimination tree by multiple calls to ML-CAQR, requiring logP rr 
steps along the critical path. 

2. The current trailing matrix is then updated to reflect the factorization of the 
panel: 

(a) Updates corresponding to factorizations at the leaves of the tree are ap- 
plied using the ML-UPFACT routine. This routine is called in parallel 
on P Tr blocks rows of size m s V-Pr r -by-ni r) . ML-UPFACT broad- 

(r) 

casts P Tr blocks of Householder's reflectors of size m s / P rr -by-b r 
from the column of nodes holding current panel along rows of com- 
pute nodes. At the deepest level, the update corresponding to a leaf is 
applied as in CAQR (see (HI). 
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Algorithm 1: ML-CAQR(A, m, n, r, P) 

if r = 1 then 

|_ Call CAQR(A, P) 

else 

for kk = 1 to n/b r do 

for Processor p = 1 to P r „ in parallel do 

hp=(m- {kk - l)b r )/P rr 

panel = A(kk ■ b r + (p — l)h p : kk ■ b r + p/» p , kk ■ b r : (kk + 1) • 6 r ) 
Call ML-CAQR(pcroeZ, h p , b T ,r - l.p) 
for j = 1 to log P rr do 

(PsourccPtarget) is the pair of processors with which this elimination is performed. 
Send local b r -by-b r to the remote processor pt arge t 
Stack two b r -by-b r upper triangular matrices in RR 
Call ML-CAQR(iJiJ, 26 r , 6 r , r - l,p source ) 
Call ML-CAQR(i?iJ, 2b r , b r , r - l,p ta rget) 

for Processor p = 1 to P, v in parallel do 

Broadcast the sets of Householder vectors to every processor belonging to the same 
processor row 

for Processor pp = 2 to P c , on same row than p in parallel do 
L Call ML-UPFACT(r - l,pp) 

for j = 1 to log P Tr do 

(PsouTcoPtarget) is the pair of processors with which this elimination was performed. 

for Processor pp = 2 to P €r on same row than p source in parallel do 

PPtarget is the remote processor on the same row than pt aT get and in the same col- 
umn than pp 

pp sends its local C to pp ta rget 
Call ML-UPELIM(r - l,pp) 
[_ Call ML-UPELIM(r - l,pp ta rget) 
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(b) Finally, the updates due to the eliminations of the intermediate R fac- 
tors are applied onto the trailing matrix using the ML-UPELIM pro- 

(VI 

cedure. Blocks of size b r -by-ris / P Cr are exchanged within a pair of 
compute nodes. At the lowest level, a partial update is then computed 
locally before being applied independently onto each processing ele- 
ments, similarly to CAQR 

3.2 Multilevel LU factorizations 

LU factorization is the cornerstone of many scientific computations, and is the 
method of choice for solving most linear systems. It consists in decomposing a 
matrix A into a lower triangular matrix L and an upper triangular matrix U such 
that PA = LU, where P is a permutation matrix required for numerical stability 
reasons. We present two variants of a multilevel algorithm, ML-CALU, for com- 
puting the LU factorization of a dense matrix. ML-CALU is a recursive algorithm. 
The first variant, 1D-ML-CALU is a uni-dimensional approach where the entire 
panel is processed by a single recursive call. The second variant, 2D-ML-CALU, 
processes a panel by multiple recursive calls on sub-panels followed by a "reduc- 
tion" phase similar to that of ML-CAQR. The base case of both recursive variants 
is CALU lfl3l , which uses tournament pivoting to select pivots. 

In the case of 1D-ML-CALU, the algorithm proceeds as follows at each re- 
cursion level r. (1) 1D-ML-CALU is recursively applied to an entire panel of b r 
columns, only the number of compute nodes along the columns varying. (2) Once 
a panel is factored, a block of rows of U is computed by P Cr x P* compute nodes 
of level r. (3) The trailing matrix is finally updated after a broadcast of the block 
column of L along rows of the process grid and the block row of U along columns 
of the process grid. The matrix-matrix operations are performed using a multilevel 
matrix product algorithm, ML-CANNON, which is based on the optimal Cannon 
algorithm. For more details, we refer the interested reader to Algorithm [2] in Ap- 
pendix [A] At the deepest level of recursion, panels of size m x b\ are factored by 
CALU using P r processing elements of level 1. Gaussian elimination with partial 
pivoting is first applied to blocks of size (m/ P r )-by-b\, located at the leaves of the 
reduction tree. These candidate pivot rows are then combined using tournament 
pivoting, which involves communications at every level in the hierarchy. Algo- 
rifhm[3]in Appendix [B] describes in details ID ML-CALU In terms of numerical 
stability, 1D-ML-CALU is equivalent to performing CALU on a matrix of size 
m x n, using a block size b\ and a grid of processors P = P r x P c . 

The 2D-ML-CALU algorithm was first introduced in ||9l and analyzed for 
two-levels platforms. Here we extend the analysis of Algorithm |4| given in Ap- 
pendix [Bj to deeper platforms. It proceeds as follows: (1) the panel is recursively 
factored with 2D-ML-CALU with a block size corresponding to the next level in 
the hierarchy. Note that at the deepest level of recursion, 2D-ML-CALU calls 
CALU. (2) The selected sets of pivot candidates are merged two-by-two along the 
reduction tree, where the reduction operator is 2D-ML-CALU. At the end of the 
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preprocessing step, the final set of pivot rows is selected and each node working 
on the panel has the pivot information and the diagonal block of U. (3) Then the 
computed permutation is applied to the input matrix, the block column of L and 
the block row of U are computed. (4) Finally, after the broadcast of L and U to 
appropriate nodes, as in 1D-ML-CALU, the trailing matrix is updated with ML- 
CANNON. 

4 Performance models 

In this section, we provide a cost analysis of both ML-CAQR and ML-CALU 
within the HCP model. We first analyse two recursive communication routines 
which are used by all multilevel algorithms presented in this study, namely the 
point to point communication and the broadcast operations. 

Point to point communication of a volume D of data between two compute 
nodes of level r involves compute nodes from level 1 to level r. At every level, 
subnodes send their local data to their counterparts in the remote node of level r. 
The associated communication costs are: 

Wr C omm(1 . . . r, D) = YJk=i T^Mfc , £rcomm(1 • • • r, D) = a\ + Y7k=2 §n^ a k ■ 

k k k 

The broadcast operation between P Cr compute nodes of level r is very similar 
to point to point communication, except that at every level, a node involved broad- 
casts its data to P Cr counterparts. A broadcast can thus be seen as log P Cr point to 
point communications. 

4.1 ML-CAQR 

We now review the cost of ML-CAQR in terms of computations as well as commu- 
nications. At each recursion level r, parameters are adapted to a grid of P rr -by-P Cr 
compute nodes. The current panel is first factored by doing P Tr parallel calls to 
ML-CAQR at the leaves of the tree. Then, the resulting R factors are eliminated 
through log P Tr successive factorizations of a 2b r -by-b r matrix formed by stacking 
up two upper triangular R factors. Once a panel is factored, the trailing matrix is 
updated. However, as the Householder's reflectors are stored in a tree structure, 
like in [8 ], the updates must be done by going through each level of the tree again. 
These operations are recursively performed using ML-UPFACT for the leaves and 
ML-UPELIM for higher levels. 

Global recursive cost of ML-CAQR. We define the following contributions to 
the global cost of ML-CAQR: Tcaqr (jn, n, b, P) is the cost of factoring a matrix 
of size m-by-n with CAQR using P processors and a block size b. 
Tml-caqr (m, n, b, P) is the cost of ML-CAQR on a m-by-n matrix using P 
processors and a block size b. Tml-upfact (tu, n, b, P) is the cost of updating 
the trailing matrix to reflect factorizations at the leaves of the elimination trees. 
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Tml-upelim {m, n, b, P) is the cost of applying updates corresponding to higher 
levels in the trees. In terms of communication, ML-UPFACT consists in broad- 
casting Householder reflectors along process rows, while ML-UPELIM corre- 
sponds to log P Tr point to point communications of trailing matrix blocks between 
pairs of nodes. Using these notations, the cost of ML-CAQR can be recursively 
expressed as: 



Tml-caqr {m. n, b r , P r ) = 



n/b. 



m- («-l)6, 

CAQR I p ,0r,0r-ll P r- 



+ tog Pr r ' r RC0 MM(l y ) 

+ logP Tr ■ Tml-caqr (2f> r , b T , 6,-1, P,-l) 
fm - (s - l)b r n - si),. 

+ J ML-UPFACT [ 



P r ,. 

I Pr r " ^ML-UPELIM 26, 



P Cr 

n — sb r 



6,-1, Pr-l 
K-l.Pr-1 



(4) 



Tcaqr (m,n,h,Pi) 



Bounding cost of ML-CAQR. ML-CAQR uses successive elimination trees 

at each recursion depth r, each completed in log P Tr steps. As successive trees from 

level I down to level r come from different recursive calls, they are sequentialized. 

Thus, the total number of calls at a given recursion depth r can be upper-bounded 

by N r = 2 l - r X\ l j=r \ogP rj . The global cost of ML-CAQR can therefore be 

expressed in terms of number of calls at each level of recursion, broken down 

between calls performed on leaves or higher level in the trees. Details on this cost is 

given in Appendix |c| Assuming that for each level k, we have P rk = P Ck = \J~P~k 

, and block sizes chosen to make the additional costs lower order terms, that is 
rl 



b k = 0(n/(y/P% ■ II =fe log Pj))> the cost of ML-CAQR is: 



-Fml-caqr (n, n) < 
Wml-caqr [n, n) < 



Sml-caqr («i n) < 



in- 



-0 



7 



p jMogPi + logfl+ti-TJlogPj 



l-n 1 



E 



l/fflogP, 



PlogP 
(/ - k) • n 2 



p;io g p ; 



'■\/p-n i °s :i - p ) +° x/p-ri 1 ^ 2 ^ 

J = l \ 3=1 



E 



= -{l-k)logP k + 



■logP| + 



Bk y/Pj log P, 

•log Pi + 



BiVPi^JPj ' "VP//P1^P 



(5) 



(6) 



(7) 



Altogether, though the recursive nature of ML-CAQR leads to at least three 
times more computations than the optimal algorithm, which is similar to other 
recursive approaches iTTTTl . it allows to reach the lower bound at all levels of the 
hierarchy up to polylogarithmic factors. Indeed, chosing appropriate block sizes 
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makes most of the extra computational costs lower order while maintaining the 
optimality in terms of communications. 



4.2 2D multilevel CALU 



In this section we only detail the cost of 2D-ML-CALU with respect to the HCP 
model. Thus, for simplicity, we refer to it as ML-CALU throughout the rest of the 
paper. We note that we use the same reasoning as ML-CAQR to derive the recur- 
sive cost of ML-CALU, and the same approach and approximations to estimate its 
total cost. Thus for a square n-by-n matrix and using I levels of recursion (I > 2) 
the cost of ML-CALU is: 



-Fml-calu {n, n) < 



Wml-calu (n, n) < 



•Sml-calu (n, n) < 



2n n n A 

3P + P\ og 2 P, + P 1 



in 



16 1287 



+ 



2-/P 



log ^n( i+ 2 iog p ' 



1 2 



/8 



IS) 



(<-» 



+i V 1(i+ 5 ) n( i +i io g p I )i' 

j=3 



■2JP 



iogi 3 iII( 1 + 2 lo e p '' 



/s, 2n „ i-k, (i-; 



(10) 



'(!+ j)n( i+ 5 io g p /) 



Equation [8] shows that ML-CALU performs more floating-point operations 
than CALU. This is because of the recursive calls during the panel factorization. 
Note that certain assumptions should be done regarding the hierarchical structure 
of the computational system in order to keep the extra flops as a low order term, 
and therefore asymptotically reach the lower bounds on computation. 

In terms of communications, we can conclude that ML-CALU attains the 
lower bounds derived in section ^ modulo a factor that depends on l 2 Y\ l j=2 1°§ 
at each level k of hierarchy. Thus it reduces the communication cost at each level 
of a hierarchical system. Note that in practice the number of levels is going to re- 
main small, while the number of processors will be large. We note that we do not 
give the detailed cost of 1D-ML-CALU here. However we would like to point that 
it attains the lower bounds derived under the HCP model in terms of bandwidth at 
each level of parallelism. However in terms of latency the lower bound is only met 
at the deepest level of parallelism. 



5 Experimental results 

5.1 Numerical stability of ML-CALU 

Since ML-CALU is based on recursive calls, its stability can be different from 
that of CALU. Our experiments show that up to three levels of parallelism ML- 
CALU exhibits a good stability, however further investigation is required if more 
than three levels of parallelism are used. We study both the stability of the LU 
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decomposition and of the linear solver, in terms of growth factor and three differ- 
ent backward errors: the normwise backward error, the componentwise backward 
error, and the relative error \\PA — LU\\/\\A\\. 



1 1^D_ 


1 


. . ■ ■ in i - - 


1 

I ■ ■ ■■■■ ■ ■ 


1 






■ j 1 
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T 


Growth Factor 


, || PA -LU || /||A M 


normwise backward error 


! componentwise backward error 



1 37 73 109 144 



Figure 2: Ratios of 3-level CALU's growth factor and backward errors to GEPP's. 

Figure [2] displays the values of the ratios of 3-level CALU's growth factor and 
backward errors to those of GEPP for 36 special matrices lfT3l . The tested matrices 
are of size 8192, using the following parameters: P r - 3 = 16, 63 = 64, P r , 2 = 4, 
62 = 32, P n = 4, and 61 = 8. We can see that nearly all ratios are between 0.002 
and 2.4 for all tested matrices. For the growth factors, the ratio is of order 1 in 69% 
of the cases. For the relative errors, the ratio is of order 1 in 47% of the cases. 

We note that in most cases, ML-CALU uses tournament pivoting to select piv- 
ots at each level of the recursion, which does not ensure that the element of maxi- 
mum magnitude in the column is used as pivot, neither at each level of the hierar- 
chy, nor at each step of the LU factorization, that is globally for the panel. For that 
reason we consider a threshold 77-, defined as the quotient of the pivot used at step 
k divided by the maximum value in column k. We observe that in practice the piv- 
ots used by recursive tournament pivoting are close to the elements of maximum 
magnitude in the respective columns for both binary tree based 2-level CALU and 
binary tree based 3-level CALU. For example, for binary tree based 3-level CALU, 
the selected pivot rows are equal to the elements of maximum magnitude in 63% 
of the cases, and for the rest of the cases the minimum threshold r m i n is larger than 
0.30. 

5.2 Performance predictions 

In this section, we present performance predictions on a sample exascale platform. 
Current petascale platforms already display a hierarchical nature which strongly 
impacts the performance of parallel applications. Exascale will dramatically am- 
plify this trend. We plan here to provide an insight on what could be observed on 
such platforms. 

We model the platform with respect to the HCP model, and use it to estimate 
the running times of our algorithms. As exascale platforms are not available yet, 
we base our sample exascale platform on the characteristics of NERSC Hopper (2l 
1], a petascale platform planned to reach the Exascale around year 2018. It is 
composed of Compute Nodes, each with two hexacore AMD Opteron Magny-cours 
2.1GHz processors offering a peak performance of 8.4 GFlop/s, with 32 GB of 
memory. Nodes are connected in pairs to Gemini ASICs, which are interconnected 
through the Gemini network |[T8l[T2l . Detailed parameters of the Hopper platform 
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are presented in Table [T] 

The target exascale platform is obtained by increasing the number of nodes at 
all 3 levels, leading to a total of 1M nodes. The amount of memory per processing 
element is kept constant at 1.3 GB, while latencies and bandwidths are derived 
using an average 15% decrease per year for the latency and a 26% increase for the 
bandwidth H2j[T8l. These parameters are detailed in Table [TJ 





NERSC Hopper 


Exascale platform 


Level 


Type 


# 


Bandwidth 


Latency 


Type 


# 


Bandwidth 


Latency 


1 


2x 6-cores Opterons 


12 


19.8 GB/s 


1 x 10~ 9 s 


Multi-cores 


1024 


300 GB/s 


1 x 10 _1 "s 


2 


Hopper nodes 


2 


10.4 GB/s 


1 x 10" 6 s 


Nodes 


32 


150 GB/s 


1 x 10~ 7 s 


3 


Gemini ASICS 


9350 


3.5 GB/s 


1.5 x 10~ 6 s 


Interconnects 


32768 


50 GB/s 


1.5 x 10~ 7 s 



Table 1: Characteristics of NERSC Hopper and sample exascale platform. 



Moreover, in order to assess the performance of multilevel algorithms, costs of 
state-of-the-art 1 -level communication avoiding algorithms need to be expressed 
in the HCP model. To this end, we assume (1) each communication to go through 
the entire hierarchy: two communicating nodes thus belong to two distant nodes of 
level I, hence a bandwidth fy. (2) Bandwidth is shared among parallel communi- 
cations. 

We evaluate the performance of the ML-C AQR and ML-C ALU algorithms as 
well as their corresponding 1 -level routines on a matrix of size n x n, distributed 
over a square 2D grid of P^ processors at each level k of the hierarchy, P^ = 
\J~P~k x \fP~k. In the following, we assume all levels to be fully '-pipelined. Similar 
results are obtained regarding forward hierarchies, which is explained by the fact 
that realistic test cases are not latency bounded. 




(a) 1-level CAQR (b) ML-CAQR 

Figure 3: Prediction of communication to computation ratio on an exascale platform. 



Performance predictions of ML-CAQR The larger the platform is, the more 
expensive the communications become. This trend can be illustrated by observing 
the communication to computation ratio, or CCR of an algorithm. On Figure[3j we 
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plot the CCR of both CAQR and ML-CAQR on the exascale platform. The shaded 
areas correspond to unrealistic cases where there are more processing elements 
than matrix elements. As the number of processing elements increases, cost of 
CAQR (on Figure 3a I is dominated by communication. Our multilevel approach 
alleviates this trend, and ML-CAQR (on Figure 3b I allows to maintain a good 
computational density, especially when the number of levels involved is large. Note 
that for I = 1, ML-CAQR and CAQR are equivalent. 



Figure 4: Speedup of ML-CAQR vs. Figure 5: Speedup of ML-CALU vs. 
CAQR CALU 

However, as ML-CAQR perform more computations than CAQR, we com- 
pare the expected running times of both algorithms. Here by running time, we de- 
note the sum between computational and communication costs. We thus assume no 
overlap between computations and communications. The ratio of the ML-CAQR 
running time over CAQR is depicted on Figure|4] ML-CAQR clearly outperforms 
CAQR when using the entire platform, despite its higher computational costs. As 
a matter of a fact in this domain, the running time is dominated by the bandwidth 
cost, and ML-CAQR significantly reduces it at all levels. 



Performance predictions of ML-CALU The same observations can be made 
on the CCR of CALU and ML-CALU, we will therefore not present the detail 
here. Regarding the running times ratio, depicted on Figure [5] we can also con- 
clude that ML-CALU is able to keep communication costs significantly lower 
than CALU, leading to significant performance improvements. 

Altogether, our performance predictions validate our multilevel approach for 
large scale hierarchical platforms that will arise with the Exascale. Indeed, by tak- 
ing communications into consideration at all levels, ML-CAQR and ML-CALU 
deliver a high level of performance, even at scales where performance is hindered 
by communication costs with 1 -level communication avoiding algorithms. 
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6 Conclusion 

In this paper we have introduced two algorithms, ML-CALU and ML-CAQR, that 
minimize communication over multiple levels of parallelism at the cost of perform- 
ing redundant computation. The complexity analysis is performed by using HCP, 
a model that takes into account the cost of communication at each level of a hier- 
archical platform. The multilevel QR algorithm has similar stability properties to 
classic algorithms. Two variants of the multilevel LU factorization are discussed. 
A first variant, based on a uni-dimensional recursive approach, has the same sta- 
bility as CALU. However, while it minimizes bandwidth over multiple levels of 
parallelism, it allows to minimize latency only over one level of parallelism. The 
second variant which uses a two-dimensional recursive approach, is shown to be 
stable in practice, and reduces both bandwidth and latency over multiple levels of 
parallelism. 

Our performance predictions on a model of an exascale platform show that for 
strong scaling, the multilevel algorithms lead to important speedups compared to 
the algorithms which minimize communication over only one level of parallelism. 
In most of the cases, minimizing bandwidth is the key factor for improving scal- 
ability, and hence the ID ML-CALU is an appropriate choice for both ensuring 
numerical stability and being efficient in practice. 
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A Appendix: ML-CANNON 

To perform matrix multiplication C = C + A ■ B, we can use Cannon's algorithm. It is 
known that (see for example flU) assuming minimal memory usage, Cannon's algorithm 
is asymptotically optimal in terms of both bandwidth, and latency, that is the volume of 
data and the number of messages send by each node are asymptotically optimal. In the 
following we present a recursive Cannon algorithm over a hierarchy of nodes. We refer 
to this algorithm as ML-CANNON, and we consider I levels of parallelism. Algorithm [2] 
presents the communication details of ML-CANNON. 



Algorithm 2: ML-CANNON (C, A, B, P k , k) 

Input: three square matrices C, A, and B, k the number of recursion level, the number of 
processors at level k 
Output: C = C + AB 
if k == 1 then 

|_ CANNON(C, A, B, Pi) 

else 

for i = 1 to y r P~k in parallel do 

|^ left-circular-shift row i of A by i 

for i = 1 to \/T\ in parallel do 

|^ up-circular-shift column i of B by i 

for h = 1 to \/Pk (sequential) do 

for % = 1 to y/T\ and j = 1 to \JT\ in parallel do 

left-circular- shift row i of A by 1 
up-circular-shift column i of B by 1 



B Appendix: 1D-ML-CALU and 2D-ML-CALU 

ID multilevel CALU. ID ML-CALU is described in Algorithm^ It is applied to a 
matrix A of size m x n, using / levels of recursion and a total number of P = Y\.%=i Pi 
compute nodes of level 1 at the topmost level of the recursion. These compute nodes are 
organized as a two-dimensional grid at each level, Pi = P r . x P c . . 
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Algorithm 3: ID multilevel communication avoiding LU factorization 

Input: m X n matrix A, the recursion level I, block size 6;, the total number of compute units 

if l == 1 then 

|_ pi,Li,I/i] = CALU(A,6i,P r xP cl ) 

else 

M = m/bi,N = n/bi 
for K = 1 to N do 

[ilifif , Lk-.m,k > Ukk ] = Recursive — CALXJ(A K . M K ,l — 1, 

/* Apply permutation and compute block row of [/ */ 

for each compute unit at level I owning a block Ak.j, J = K + 1 to N in parallel do 

Uk,j = Lj^ K A KJ 

I* call multilevel dtrsm using P Cl X Yli=i Pi processing node of level 1 */ 
/* Update the trailing submatrix */ 

for each compute unit at level I owning a block Aj j of the trailing submatrix, 
J, J = K + 1 to M, N in parallel do 

= -4/, j — L I:K Uk,.j 
I* call multilevel dgemm using P r X fl'=i Pc t processing node of level 1 */ 
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2D multilevel CALU. The 2D ML-CALU algorithm, given in Algorithm |] details 
the communication performed during the factorization. 



Algorithm 4: 2D multilevel communication avoiding LU factorization 

Input: m X n matrix A, level of parallelism ( in the hierarchy, block size number of nodes 
Pi = P n x P CI 
if j = 1 then 
|_ Call CALU(A, m, n, l,Pi) 

else 

for k = 1 to n/bi do 

m p = (m - (fc - l)bi)/P ri 

n p = (n-{k-l)bi)/P c , 

I* factor leaves of the panel */ 

for Processor p = 1 to P r , in parallel do 

leaf = A((k-l)-bi + (p- l)m p + 1 : (fc - 1) • 6; + p- m p , (k- 1) • 6j + 1 : k ■ 6() 

Call ML-CALU(leaf,m p ,6 ; ,i - l,P;_i) 

/* Reduction steps */ 
for j = 1 to log Pr, do 

Stack two b ; -by-6; sets of candidate pivot rows in P 

Call ML-CALV(B,2bi,bi,l - l,Pj_i) 

/* Compute block column of L */ 

for Processor p = 1 to P r , in parallel do 

Compute Lp k = L(k ■ 6; + (p — l)m p + 1 : k ■ 6; + p ■ m p , [k — 1) • 6; + 1 : k ■ 6j) 
/* = ^p,fc ' ^kk) using multilevel algorithm with P;_i nodes at level (/ — 1) */ 

/* Apply all row permutations */ 

for Processor p = 1 to P r , in parallel do 

Broadcast pivot information along the rows of the process grid 
/* all to all reduce operation using P; processors of level / */ 
Swap rows at left and right 

Broadcast right diagonal block of L^ k along rows of the process grid 

/* Compute block row of U */ 

for Processor p = 1 to P c , in parallel do 

Compute Uk^p- U((k — 1) • 6; + 1 : k ■ 6;, k ■ 6; + (p — l)n p + 1 : k ■ b; + p ■ n p ) 
I* (t/fc p = \ ■ using multilevel algorithm with P;„i nodes at level (I — 1) */ 

/* Update trailing matrix */ 

for Processor p = 1 to P e , in parallel do 

| Broadcast Uk tP along the columns of the process grid 

for Processor p = 1 to P r , in parallel do 

I Broadcast L p ^ along the rows of the process grid 

for Processor p = 1 to P; in parallel do 

A(k -b[ + (p— l)m p + 1 : k ■ b[ + p ■ m p , k-bi + (p — l)n p + 1 : k ■ b; +p ■ n p ) = 
J 4(fc-b; + (p-l)m p + l : k-bi+p-rn p , k-b l + (p-l)n p + l : k-bi+p-n p )-L Pik -U kyP 

I* using multilevel Cannon with P; nodes at level I *l 
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C Appendix: detailed upper bound on the cost of ML- 
CAQR 

The global recursive cost of ML-C AQR can be upper bounded by: 

Tml-caqr (to, n) < 



t -Tml-caqr -7=,b 2 ,h,Pi 



+ Z)r=2 V ' ^rbcast(1 • • • r, -7=) + f- ■ Tml-UPFACT 



Z^r=2 h r 



7rcomm(1 • • • r, -f) + jf- • Tml-caqr 



m n 



2b r 



+7 1 RB cast(1 • ■ • r, b 2 r ) + f ■ Tmlupfact 



+T R0 QMM(l...r,-^) + fe-r 1 



n;; 

ML-UPELIM 



, b 2 ,h,Pi 



2b r 

n;=2 



+ 



n-Ni 



7rcomm(1 2") + ^ ■ Tml-caqr ( ff^^/p=' &2 ''' ■ P| 



+7rcomm(1 • • • 77^) + 5j • Tml-upelim ^ ^ftt 



2fc, 
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